Final Project: A Model to Estimate HR Scores (2020-)
1 Introduction
1.1 Prediction Problem
Two of my three dissertation chapters rely on the use of a dependent variable measuring human rights respect. Hitherto, I have depended on Fariss’s Human Rights (HR) Scores,1 but he has not updated his dataset since 2020.2 In my previous final project, I identified a “raw” form of the physical violence index, derived from a simple average of the freedoms from political killings and torture indicators featured in the Varieties of Democracy (V-Dem) dataset, as perhaps the best available substitute for HR Scores. However, for this project, I would like to try my hand at predicting the missing data in HR Scores—namely, all country-year observations 2021-2023, the most recent full years since 2020—with the raw physical violence index as an included covariate, inter alia.
1 For the original article wherein these scores were introduced, see Fariss, 2014.
2 See his Dataverse page for evidence.
1.2 Overview
2 EDA of Target Variable
In my previous memo, I failed to provide a visualization of the distribution of HR Scores, so I do so here:
The above shows that HR Scores isn’t distributed in a perfectly symmetrical manner, featuring a slight right-skew and a degree of bimodality. Nevertheless, these features seem relatively minor, and most methods available to correct them do not apply given the nature of the variable.3
3 Namely, square-root, log, and Box-Cox transformations only work for positive variables, but approximately half of HR Scores’ values are negative.
Below, however, is evidence that I have attempted a Yeo-Johnson transformation of HR Scores:
The transformation does rectify the right-skew, if at the expense of introducing a slight left-skew. It also fails to eliminate the bimodality feature. In tandem with my present inability to compute the transformed variable’s inverse subsequent to model fitting, these considerations prompt me to proceed with HR Scores unaltered, though I welcome feedback as to whether this is the right step to take.
3 Pre-Recipe Preprocessing
Prior to writing my recipes, I complete the following noteworthy preprocessing steps:
- Merge the V-Dem, HR Scores, and Political Terror Scale (PTS) datasets. At the outset, I desired to include variables from the latter in my models, as they were explicitly noted as inputs for the original HR Scores; however, they featured widespread missingness, to the extent that imputing their values proved to be computationally infeasible.
- For the V-Dem dataset, recode Czechoslovakia’s pre-Velvet Divorce (1992) observations as those of the Czech Republic.4 Doing so rectifies widespread missingness in this case. I deemed this justifiable on two counts:
- Fariss assigns the Czech Republic
cowcodeto Czechoslovakia. - V-Dem denominates both Czechoslovakia and the Czech Republic “Czechia”; there is no overlap in naming conventions or otherwise with Slovakia, which doesn’t appear in the dataset until 1993.
- Fariss assigns the Czech Republic
- Remove country-years that appear in HR Scores but are systematically absent in V-Dem—cases where the value on “N Missing” is positive as shown in Table 1, below:
4 I.e., recode cowcode == 315 (Czechoslovakia) to cowcode == 316 (Czech Republic).
- These observations mainly involve:
- Post-Soviet states (e.g., Armenia, Azerbaijan, etc.) and post-WWII states (e.g., West and East Germany).
- European microstates (e.g., Liechtenstein and Monaco) and small island states (e.g., Samoa, St. Vincent and the Grenadines, etc.).
- The former is more tolerable, for they stem from minor discrepancies in coding start/end dates and only pertain to a handful of years. The latter invites more reason for pause; indeed, to delete these observations is effectively to remove the world’s smallest countries from the ambit of my project entirely. Nevertheless, it doesn’t make sense, in my estimation, to retain them when there exist no data whatsoever—including the most basic of figures, such as population, GDP, etc.—on which reasonably-accurate imputations might be based. I proceed from here, unfortunately having to live with this limitation to my projections, though I again welcome feedback as to whether I’ve made the right choice.
- Select as predictors the variables I identified in my previous final project as being related in theory to HR Scores, as well as the entirety of V-Dem’s high-level and medium-level indices.
Upon considering and completing these steps, I move forward with a preprocessed dataset of 52 variables and 10524 observations. The plot below demonstrates that the preprocessing steps involving HR Scores does not significantly alter the distribution thereof:5
5 Admittedly, the distribution might be marginally more right-skewed; this likely owes itself to the removal of the European microstates, whose human rights levels are generally very high.
The preprocessed dataset also exhibits a low degree of missingness, as demonstrated by the plot below, meaning that the computational demands for our imputations should be rather low in turn:
4 Data Splitting Specifications
I opted to split my data with proportions of 75-25 training/testing, 90-10 analysis/assessment. The cross-validation split was repeated five times, resulting in a total of 50 folds.
5 Recipes
My null (null) and basic baseline (lm) models—analyzed in Section 6—draw on three recipes, each distinguished by the number of neighbors set for KNN-imputation: 5, 10, and 20, respectively.6 I establish and test these recipe variants, at this stage, to appraise the extent to which changes in the number of neighbors set for imputation effectuate changes in model performance.
6 I considered bagged-tree imputation as well, but it proved computationally infeasible.
Aside from KNN-imputation, the main preprocessing components shared by each recipe-group are as follows:
- Transforming country ID (
cowcode) and year to a dummy variable. - Log-transforming population, GDP, and GDP-per-capita.7
- Normalizing all numeric predictors.
- Removing the year dummies subsequent to KNN-imputation.
7 These variables are widely-known as right-skewed, and they are indeed so in the V-Dem dataset. For evidence that I have verified this, see scripts/appendices/0_appendices.R.
8 The 2024 V-Dem Dataset, which will provide data up to 2023, is scheduled to be released on 07 March 2024.
9 Put differently, because year is a factor, there is no way to estimate the relationship between “2023,” “2024,” etc. and HR Scores when there exists wholesale missingness of HR Scores for those years.
This final step is important. As a factor, year can continue to be used as an imputation predictor for V-Dem observations, because there will almost certainly be V-Dem data for future years;8 yet it cannot be used to predict HR Scores, for that variable ends in 2019.9
In aggregate, the preprocessing steps results in a training set of 226 variables: 1 outcome, 2 ID variables,10 and 223 predictors—179 of which are country dummies.
10 country_name and cow_year
Also included as a baseline in Section 6 is a fourth recipe (akt_lm), which simply avails itself of my unscaled Political Violence Index (PVI), created in my previous final project, as the sole predictor of HR Scores. I do so in order to test the appropriateness of merely substituting it for HR Scores.
The ridge, lasso, and elastic-net models seen in Section 7 use the same recipe as that underpinning the 5-neighbor baselines.11 The KNN, random forest, and boosted tree models—also appearing in Section 7—rely on a similar recipe, the only difference being the inclusion of one-hot encoding for the country-ID dummy variables.12
11 See Section 6, below, for a discussion as to why I proceed with the 5-neighbor recipe for the tuning fits.
12 This means that the total number of predictors utilized by these models is 224—one more than that of the others.
To summarize, the recipes I deploy and their apportionment to the models can be organized as follows:
- Main:
- 5 Neighbors: basic baseline, null, ridge, lasso, elastic net
- 10 Neighbors: basic baseline, null
- 20 Neighbors: basic baseline, null
- Tree:
- 5 Neighbors: KNN, random forest, boosted tree
- Unscaled PVI
These amount to five recipes in total: three “main,” one “tree,” and one additional baseline (unscaled PVI).
6 Baseline Fits
Table 2, below, gives the performance of my baseline fits on the cross-validation folds as measured by mean RMSE:
| Workflow ID | Mean RMSE | Std. Error |
|---|---|---|
| neighbors_5_lm | 0.5651164 | 0.0023461 |
| neighbors_10_lm | 0.5653670 | 0.0023460 |
| neighbors_20_lm | 0.5657787 | 0.0023566 |
| akt_lm | 1.0097946 | 0.0042792 |
| neighbors_5_null | 1.4652267 | 0.0045097 |
| neighbors_10_null | 1.4652267 | 0.0045097 |
| neighbors_20_null | 1.4652267 | 0.0045097 |
As we can see, the null models perform well-worse than the basic baseline models; the unscaled PVI baseline splits the difference between the two.
There are two important takeaways at this juncture, in my view. First, the basic baseline outperforms the unscaled PVI baseline, meaning we can proceed knowing that there exist better ways to estimate HR Scores than to use the unscaled PVI as a simple substitute for HR Scores. Second, changes in the number of neighbors used for KNN-imputation seem to have little effect on model performance. This is good insofar as it is no longer a concern of ours; we will proceed with the step_impute_knn() default of neighbors = 5 for all remaining workflows, though we theoretically could have selected a larger (or smaller) number with marginal impact on our findings.
7 Tuned Fits
I subsequently tune the following hyperparameters for six models on the cross-validation folds:
- Ridge:
penalty - Lasso:
penalty - Elastic Net:
penaltyandmixture - KNN:
neighbors - Random Forest:
mtryandmin_n - Boosted Tree:
mtry,min_n, andlearn_rate
For the creation of the respective tuning grids, the first four use levels = 10, whereas the latter two use levels = 5. The random forest model further uses trees = 1000 and an mtry range set to c(1, 15), whereas the boosted tree model uses the same mtry range and a learn_rate range set to c(-3, -0.2).
The processing times for these models, with parallel processing across eight cores where possible, were approximately as follows:13
13 I simply timed these with my computer’s clock.
- Ridge: 5 minutes
- Lasso: 4 minutes
- Elastic Net: 9 minutes
- KNN: 12 minutes
- Random Forest: 92 minutes
- Boosted Tree: 65 minutes
Below is Table 3, which gives the best mean RMSE of each tuned fit:
| Model | Mean RMSE | Std. Error |
|---|---|---|
| knn | 0.2429732 | 0.0028678 |
| rf | 0.2869661 | 0.0018060 |
| bt | 0.4767500 | 0.0025412 |
| en | 0.5587930 | 0.0022834 |
| lasso | 0.5589801 | 0.0022842 |
| ridge | 0.5813393 | 0.0022810 |
The optimal tuning parameters for these models as assessed by mean RMSE are, respectively:
- KNN:
neighbors = 2 - Random Forest:
mtry = 14,min_n = 2 - Boosted Tree:
mtry = 14,min_n = 2,learn_rate = 0.631 - Elastic Net:
penalty = 1e-10,mixture = 0.683 - Lasso:
penalty = 1e-10 - Ridge:
penalty = 1e-10
Below is also a plot depicting the confidence intervals for each estimate:
| Model | Mean RMSE | Std. Error |
|---|---|---|
| knn | 0.2405136 | 0.0030602 |
| rf | 0.3061221 | 0.0016808 |
| bt | 0.5048741 | 0.0025192 |
| en | 0.5734804 | 0.0024225 |
| lasso | 0.5734929 | 0.0024186 |
| ridge | 0.5882699 | 0.0022959 |
Table 3 and the above plot evince that, while the best random-forest model possesses the lowest standard error, it is the best KNN model that has the lowest mean RMSE and confidence interval thereof. As such, we proceed with our final fit by selecting the best KNN model.
8 Final Fit
Table 5, below, gives the performance metrics for the final fit applied to the testing set:
| Metric | Estimate |
|---|---|
| rmse | 0.2305683 |
| mae | 0.1225338 |
| mape | 59.8248677 |
| rsq | 0.9741917 |
The RMSE is even lower when compared to that of the cross-validation folds (Table 4). The MAE is about half has large as the RMSE; and at approximately 0.975, the \(R^{2}\) statistic is exceedingly high. (The MAPE is about 51.2%, but this is a small figure in absolute terms, the preponderance of HR scores being clustered around 0.)
The RMSE of approximately 0.227 represents an exceedingly marginal difference. For the full set of HR Scores, the minimum is approximately -3.46, while the maximum is approximately 5.34, for a range of approximately 8.8; the RMSE’s range,14 then, represents only about 5.17% of the full range. Recalling that HR Scores are themselves estimates with a mean standard deviation of approximately 0.33 lends further credibility to the accuracy of our model.15 Qualitatively, a score difference of 0.227 does not seem to mean much either. Indeed, the examples of countries that experienced such a score change did not witness, to my knowledge, any appreciable changes in their underlying political milieus.16
14 Computed (approximately) by \(0.227*2\).
15 For the code computing this figure, see scripts/appendices/0_appendices.R.
16 For these examples, see scripts/appendices/0_appendices.R.
To round off this discussion is the below scatterplot, which depicts the relationship between our predictions and actual values in the testing set, illustrating the tight fit—and hence high degree of accuracy—of our final model:
9 Remaining Considerations
At this stage, I feel as though my final project is near to completion. Aside from the issues I’ve already identified as meriting potential feedback—namely, Yeo-Johnson transformation of the outcome (Section 2) and the removal of certain countries from the dataset (Section 3)—I welcome advice as to how to think about, or tackle if necessary, the issue of multicollinearity.
Needless to say, four of my five recipes are effectively “kitchen sink” in kind (the outlier being the unscaled PVI recipe). The V-Dem variables capture interrelated phenomena (e.g., elections and civil liberties) and occasionally comprise additive measures as well as their inputs (e.g., PVI and its components, the political killings and torture metrics). They also contain variants of the same variable set to different scales (e.g., the three-point, four-point, and five-point versions of the PVI). A particularly concerning example of this, to me, is the inclusion of both the unscaled PVI and original (unscaled) PVI, the latter simply being the former set to a 0-to-1 scale. The two are hence tightly related, namely through a logistic function, as demonstrated in the plot below:17
17 My work in this section can be found in scripts/appendices/.
I opted to include all such variants, for I didn’t want to presuppose that one would be “better” in predicting HR Scores than the other(s). I’m also of the understanding that multicollinearity may be less of a concern for prediction problems such as ours, where the primary objective is simply to produce the most accurate estimates possible—in contrast to inference problems, where robust coefficients and p-values are the gold standard. Nonetheless, I remain worried about the impact of multicollinearity on the strength of my predictions.
To this end, I created and briefly tested two sets of new recipes from my original set of “main” recipes (Section 5), both of which are multicollinearity averse. The first is less averse, simply removing the scaled PVI18 as a predictor from each main recipe.19 The second is more averse, removing not only said variable, but also all other variable-variants as predictors.20
18 v2x_clphy
19 I do, however, allow the scaled PVI to be present for the KNN-imputation step. This is also true for all other variable-variants in the more-averse set of recipes.
20 These variables are, namely, the PVI Ordinal (e_v2x_clphy *_3C, *_4C, *_5C), the Equality before the Law and Individual Liberty Index Ordinal (e_v2xcl_rol *_3C, *_4C, *_5C), and the Civil Liberties Index Ordinal (e_v2x_civlib *_3C, *_4C, *_5C).
For now, I tested these recipes on the baseline models, exclusively. Therefore, the recipes I deployed and their apportionment to the models can be organized as follows:
- Least Averse:
- 5 Neighbors: basic baseline, null
- 10 Neighbors: basic baseline, null
- 20 Neighbors: basic baseline, null
- Most Averse:
- 5 Neighbors: basic baseline, null
- 10 Neighbors: basic baseline, null
- 20 Neighbors: basic baseline, null
These amount to six recipes in total: three that are less multicollinearity averse, and three that are more multicollinearity averse.
The performance of these fits on the cross-validation folds, as measured by mean RMSE, are given in Table 6 and Table 7, below:
| Workflow ID | Mean RMSE | Std. Error |
|---|---|---|
| neighbors_5_lm | 0.5667459 | 0.0023647 |
| neighbors_10_lm | 0.5670072 | 0.0023641 |
| neighbors_20_lm | 0.5674522 | 0.0023732 |
| neighbors_5_null | 1.4652267 | 0.0045097 |
| neighbors_10_null | 1.4652267 | 0.0045097 |
| neighbors_20_null | 1.4652267 | 0.0045097 |
| Workflow ID | Mean RMSE | Std. Error |
|---|---|---|
| neighbors_5_lm | 0.5728643 | 0.0024738 |
| neighbors_10_lm | 0.5731172 | 0.0024712 |
| neighbors_20_lm | 0.5735224 | 0.0024798 |
| neighbors_5_null | 1.4652267 | 0.0045097 |
| neighbors_10_null | 1.4652267 | 0.0045097 |
| neighbors_20_null | 1.4652267 | 0.0045097 |
The basic baseline (*_lm) results seemingly signify a deterioration in quality: vis-à-vis the metrics seen in Table 2, the models perform worse, with gradual yet consistent increases in both mean RMSE and standard error as they become more multicollinearity averse.
If predictive accuracy is the chief objective, then these results suggest that I should not, in fact, proceed with the multicollinearity-averse recipes—irrespective of whether I ultimately use the values produced from my model for inferential tasks. Nevertheless, I am open to feedback that would confirm or challenge my interpretation of or decision-making around this topic.